{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.6",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "colab": {
      "name": "01 - Introducing txtai",
      "provenance": [],
      "collapsed_sections": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "POWZoSJR6XzK"
      },
      "source": [
        "# Introducing txtai\n",
        "\n",
        "[txtai](https://github.com/neuml/txtai) executes machine-learning workflows to transform data and build AI-powered semantic search applications. \n",
        "\n",
        "Traditional search systems use keywords to find data. Semantic search applications have an understanding of natural language and identify results that have the same meaning, not necessarily the same keywords.\n",
        "\n",
        "Backed by state-of-the-art machine learning models, data is transformed into vector representations for search (also known as embeddings). Innovation is happening at a rapid pace, models can understand concepts in documents, audio, images and more.\n",
        "\n",
        "The following is a summary of key features:\n",
        "\n",
        "- 🔎 Large-scale similarity search with multiple index backends ([Faiss](https://github.com/facebookresearch/faiss), [Annoy](https://github.com/spotify/annoy), [Hnswlib](https://github.com/nmslib/hnswlib))\n",
        "- 📄 Create embeddings for text snippets, documents, audio, images and video. Supports transformers and word vectors.\n",
        "- 💡 Machine-learning pipelines to run extractive question-answering, zero-shot labeling, transcription, translation, summarization and text extraction\n",
        "- ↪️️ Workflows that join pipelines together to aggregate business logic. txtai processes can be microservices or full-fledged indexing workflows.\n",
        "- 🔗 API bindings for [JavaScript](https://github.com/neuml/txtai.js), [Java](https://github.com/neuml/txtai.java), [Rust](https://github.com/neuml/txtai.rs) and [Go](https://github.com/neuml/txtai.go)\n",
        "- ☁️ Cloud-native architecture that scales out with container orchestration systems (e.g. Kubernetes)\n",
        "\n",
        "Applications range from similarity search to complex NLP-driven data extractions to generate structured databases. The following applications are powered by txtai.\n",
        "\n",
        "- [paperai](https://github.com/neuml/paperai) - AI-powered literature discovery and review engine for medical/scientific papers\n",
        "- [tldrstory](https://github.com/neuml/tldrstory) - AI-powered understanding of headlines and story text\n",
        "- [neuspo](https://neuspo.com) - Fact-driven, real-time sports event and news site\n",
        "- [codequestion](https://github.com/neuml/codequestion) - Ask coding questions directly from the terminal\n",
        "\n",
        "txtai is built with Python 3.6+, [Hugging Face Transformers](https://github.com/huggingface/transformers), [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) and [FastAPI](https://github.com/tiangolo/fastapi)\n",
        "\n",
        "This notebook gives an overview of txtai and how to run similarity searches."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qa_PPKVX6XzN"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
        "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
        "trusted": true,
        "_kg_hide-output": true,
        "id": "24q-1n5i6XzQ"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vHxb_jm16Xzd"
      },
      "source": [
        "# Create an Embeddings instance\n",
        "\n",
        "The Embeddings instance is the main entrypoint for txtai. An Embeddings instance defines the method used to tokenize and convert a text section into an embeddings vector. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": true,
        "id": "QxX9EtIc6Xzg"
      },
      "source": [
        "%%capture\n",
        "\n",
        "from txtai.embeddings import Embeddings\n",
        "\n",
        "# Create embeddings model, backed by sentence-transformers & transformers\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\"})"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JAO61bcy6Xzo"
      },
      "source": [
        "# Running similarity queries\n",
        "\n",
        "An embedding instance relies on the underlying transformer model to build text embeddings. The following example shows how to use an transformers Embedding instance to run similarity searches for a list of different concepts."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
        "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
        "trusted": true,
        "id": "2j_CFGDR6Xzp",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "7a9ecc24-94d1-4ab4-a197-196532a10474"
      },
      "source": [
        "data = [\"US tops 5 million confirmed virus cases\",\n",
        "        \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n",
        "        \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
        "        \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n",
        "        \"Maine man wins $1M from $25 lottery ticket\",\n",
        "        \"Make huge profits without work, earn up to $100,000 a day\"]\n",
        "\n",
        "print(\"%-20s %s\" % (\"Query\", \"Best Match\"))\n",
        "print(\"-\" * 50)\n",
        "\n",
        "for query in (\"feel good story\", \"climate change\", \"public health story\", \"war\", \"wildlife\", \"asia\", \"lucky\", \"dishonest junk\"):\n",
        "    # Get index of best section that best matches query\n",
        "    uid = embeddings.similarity(query, data)[0][0]\n",
        "\n",
        "    print(\"%-20s %s\" % (query, data[uid]))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Query                Best Match\n",
            "--------------------------------------------------\n",
            "feel good story      Maine man wins $1M from $25 lottery ticket\n",
            "climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\n",
            "public health story  US tops 5 million confirmed virus cases\n",
            "war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
            "wildlife             The National Park Service warns against sacrificing slower friends in a bear attack\n",
            "asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
            "lucky                Maine man wins $1M from $25 lottery ticket\n",
            "dishonest junk       Make huge profits without work, earn up to $100,000 a day\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kIMbLW0t6Xzw"
      },
      "source": [
        "The example above shows for almost all of the queries, the actual text isn't stored in the list of text sections. This is the true power of transformer models over token based search. What you get out of the box is 🔥🔥🔥!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DLIjSzbq6Xzx"
      },
      "source": [
        "# Building an Embeddings index\n",
        "\n",
        "For small lists of texts, the method above works. But for larger repositories of documents, it doesn't make sense to tokenize and convert to embeddings on each query. txtai supports building pre-computed indices which signficantly improve performance. \n",
        "\n",
        "Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": true,
        "id": "cXfZtdHD6Xzy",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "df3d534e-74f5-484a-8f89-34611b809823"
      },
      "source": [
        "# Create an index for the list of text\n",
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
        "\n",
        "print(\"%-20s %s\" % (\"Query\", \"Best Match\"))\n",
        "print(\"-\" * 50)\n",
        "\n",
        "# Run an embeddings search for each query\n",
        "for query in (\"feel good story\", \"climate change\", \"public health story\", \"war\", \"wildlife\", \"asia\", \"lucky\", \"dishonest junk\"):\n",
        "    # Extract uid of first result\n",
        "    # search result format: (uid, score)\n",
        "    uid = embeddings.search(query, 1)[0][0]\n",
        "\n",
        "    # Print text\n",
        "    print(\"%-20s %s\" % (query, data[uid]))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Query                Best Match\n",
            "--------------------------------------------------\n",
            "feel good story      Maine man wins $1M from $25 lottery ticket\n",
            "climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\n",
            "public health story  US tops 5 million confirmed virus cases\n",
            "war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
            "wildlife             The National Park Service warns against sacrificing slower friends in a bear attack\n",
            "asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
            "lucky                Maine man wins $1M from $25 lottery ticket\n",
            "dishonest junk       Make huge profits without work, earn up to $100,000 a day\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6TCVl6QA6Xz5"
      },
      "source": [
        "# Embeddings load/save\n",
        "\n",
        "Embeddings indices can be saved to disk and reloaded."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": true,
        "id": "5gyO90Hc6Xz7",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "1853e9ed-622c-4986-e653-04d7d4ab4e4c"
      },
      "source": [
        "embeddings.save(\"index\")\n",
        "\n",
        "embeddings = Embeddings()\n",
        "embeddings.load(\"index\")\n",
        "\n",
        "uid = embeddings.search(\"climate change\", 1)[0][0]\n",
        "print(data[uid])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "trusted": true,
        "id": "d2uCotAE6X0J",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "45fc8372-d925-4f89-b1d3-0ae895ed74a8"
      },
      "source": [
        "!ls index"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "config\tembeddings\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6m7sYUj_AdOL"
      },
      "source": [
        "# Embeddings update/delete\n",
        "\n",
        "Updates and deletes are supported for Embedding indices. The upsert operation will insert new data and update existing data\n",
        "\n",
        "The following section runs a query, then updates a value changing the top result and finally deletes the updated value to revert back to the original query results."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2CERR0U2Ac8C",
        "outputId": "796d3e8b-557a-4140-ccfb-6a1a3c3c9474"
      },
      "source": [
        "# Run initial query\n",
        "uid = embeddings.search(\"feel good story\", 1)[0][0]\n",
        "print(\"Initial: \", data[uid])\n",
        "\n",
        "# Update data\n",
        "data[0] = \"See it: baby panda born\"\n",
        "embeddings.upsert([(0, data[0], None)])\n",
        "\n",
        "uid = embeddings.search(\"feel good story\", 1)[0][0]\n",
        "print(\"After update: \", data[uid])\n",
        "\n",
        "# Remove record just added from index\n",
        "embeddings.delete([0])\n",
        "\n",
        "# Ensure value matches previous value\n",
        "uid = embeddings.search(\"feel good story\", 1)[0][0]\n",
        "print(\"After delete: \", data[uid])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Initial:  Maine man wins $1M from $25 lottery ticket\n",
            "After update:  See it: baby panda born\n",
            "After delete:  Maine man wins $1M from $25 lottery ticket\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aDIF3tYt6X0O"
      },
      "source": [
        "# Embedding methods\n",
        "\n",
        "Embeddings supports two methods for creating text vectors, the sentence-transformers library and word embeddings vectors. Both methods have their merits as shown below:\n",
        "\n",
        "- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n",
        "  - Creates a single embeddings vector via mean pooling of vectors generated by the transformers library. \n",
        "  - Supports models stored on Hugging Face's model hub or stored locally. \n",
        "  - See sentence-transformers for details on how to create custom models, which can be kept local or uploaded to Hugging Face's model hub.\n",
        "  - Base models require significant compute capability (GPU preferred). Possible to build smaller/lighter weight models that tradeoff accuracy for speed.\n",
        "- word embeddings\n",
        "  - Creates a single embeddings vector via BM25 scoring of each word component. See this [Medium article](https://towardsdatascience.com/building-a-sentence-embedding-index-with-fasttext-and-bm25-f07e7148d240) for the logic behind this method.\n",
        "  - Backed by the [pymagnitude](https://github.com/plasticityai/magnitude) library. Pre-trained word vectors can be installed from the referenced link.\n",
        "  - See [words.py](https://github.com/neuml/txtai/blob/master/src/python/txtai/vectors/words.py) for code that can build word vectors for custom datasets.\n",
        "  - Significantly better performance with default models. For larger datasets, it offers a good tradeoff of speed and accuracy."
      ]
    }
  ]
}